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Abstract. We describe a new method to map the requested error tolerance on an ff-matrix 
approximation to the block error tolerances. Numerical experiments show that the method produces 
more efficient approximations than the standard method for kernels having singularity order greater 
than one, often by factors of 1.5 to 5 and at a lower computational cost. 
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1. Introduction. A completely dense matrix B arising from an integral or sum 
involving a singular kernel over a surface or particles can be approximated efficiently 
by a hierarchical matrix, called an H-matrix [5] . Let B be the H- matrix approximation 
to B. 

Let the surface be discretized by N elements or let there be N particles. The two 
cases differ essentially only in how a matrix entry of B is calculated. If B includes all 
and only pair- wise interactions, then B G M. NxN . In this paper B is always square, 
but the results extend to the rectangle case B G R MxN by replacing occurrences of 
N 2 by MN. The procedure to construct an _ff-matrix has four parts. First, a cluster 
tree over the particles is formed. The cluster tree induces a (nonuniquc) symmetric 
permutation of B. For notational brevity, hereafter we assume B is already ordered 
such that the identity matrix is a permutation matrix induced by the cluster tree. 
Second, pairs of clusters are found that satisfy certain criteria; associated with such 
a pair is a block of B. Third, the requested error tolerance e (hereafter usually just 
tolerance) is mapped to block tolerances. A tolerance specifies the maximum error 
allowed. Fourth, each block is approximated by a low-rank approximation (LRA) that 
satisfies that block's tolerance. A number of LRA algorithms are available. 

An LRA to a block Bi can be efficiently expressed as an outer product of two 
matrices U and V: Bi ss Bi = UV T . Let r be the number of columns in U; then r is 
the maximum rank of Bi and the rank if U and V have independent columns, as is 
always the case in this context. Let miz(Bi) be the number of nonzero numbers re- 
quired to represent Bi exactly, and similarly for other matrices. Hence nnz(B) = N 2 ; 
if Bi G M mxn , then nnz(i?j) = ran and nnz(Bi) = nnz(U) + nnz(V) = (m + n)r; 
and if B is partitioned according to the index set I, then nnz(B) = X^:ex nnz(.Bj). 
The efficiency of an _ff-matrix approximation B is measured by its compression, 
nnz(i?)/nnz(_B) = A^ 2 /nnz(i3). A standard application of an .ff-matrix approximation 
is to a sequence of matrix-vector products (MVP) with dense vectors. The work to 
compute an MVP is proportional to nnz(i?). 

In this paper we focus on the third part of constructing an _ff-matrix: how to 
map the requested tolerance e on B to block tolerances. An efficient method has 
at least these three properties. First, the tolerance is met. Second, the tolerance is 
not exceeded by much: any unrequested additional accuracy reduces the compression. 
Third, the method must not add much work to the construction process relative to a 
standard method; indeed, if the resulting compression is greater, the work will likely 
be less. We propose a method based on the inequality (|3.3[) that has these properties. 
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It may be worth emphasizing the importance of an algorithm's meeting a toler- 
ance without exceeding it by much. Many computational mathematical algorithms 
trade between speed and accuracy. A tolerance has two roles: first, it is the means 
by which the user controls the tradeoff; second, the associated error bound, either by 
itself or in combination with those from other algorithms in a larger framework, gives 
the accuracy of the approximate solution to the problem. A user who detects that an 
algorithm is giving much greater accuracy than requested is tempted to request less 
accuracy. Such a procedure requires more work in the form of repeated solutions of 
trial problems until the speed-accuracy tradeoff appears — but cannot be known with 
certainty — to be satisfactory; and it sacrifices an assured error bound because the 
relationship between tolerance and error bound is not clear. The method we shall 
describe does not principally increase compression, relative to the standard method, 
for a given (unknown) achieved error (though it docs in important cases in our ex- 
periments); rather, it increases compression for a given requested error tolerance by 
achieving an error that is not much less than the tolerance. 

2. The meaning of the requested error tolerance. In this paper and as of- 
ten implemented in //-matrix software, the tolerance e specifies the maximum allowed 
Frobenius-norm-wise relative error of the approximation B to B. Let E = B — B . It 
is requested that B satisfy 

\\B — B\\ 

\\E\\f < s\\B\\ F or, cquivalently, 11 l|F < e. (2.1) 

\\ b \\f 

If (|2.1j) holds, then ||.Ex||f < II-E'IIfII^IU < £||-5||f||#||2- Rearranging, the relative error 
in an MVP is bounded as 



\Bx - Bx\ 



|5||fIW| 2 



< £• (2.2) 



If B is square and nonsingular (as it is in this application), then we can obtain a 
simple expression for the maximum perturbation to x that yields the same error as B 
makes. Let y = Bx. Let y = Bx\ we want to bound ||fe||2 such that y = B{x + 5x). 
Rearranging, Sx = B-\y-Bx) = B^iy-y). By fl^S}, \\y - y\\ 2 < e||5|| F ||z||2 and 
so \\5x\\ 2 < eWB-'MBWFWxh, or 

2 < skf{B), (2.3) 



<• 2 



where kf(B) is the condition number of B in the Frobenius norm. If (|2.1[) instead 
has the form ||.E|| F < e||-B|| 2 , then '2' replaces 'F' in and ([231) . 

One can also specify an element- wise error (EWE) of the approximation B to B. 
Let maxjj \Eij\ = ||-E|| m ax: this is the element- wise max-norm. In this case, it is 
requested that B satisfy 

WEW^KsN-^BW^e. (2.4) 

If holds, then ||i£r||i < e||a;||i||e||i = e||x||iA = e||B||i||a;||i, where e is the vector 
of all ones, and so the relative error in an MVP is bounded as 

\\Bx - Bxh 
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which is the same as (|2.2p except the 1-norm is used. Similarly, one obtains the 
analogue of (|2.3[) : 

llfelh 

INIi 

Observe that an element-wise norm (Frobenius, max) applied to blocks yields an 
error bound in p- norms (respectively 2 and 1). 

3. Meeting the requested error tolerance. Let a matrix be partitioned into 
blocks indexed by i G X; we often write ^ rather than X^ei' ^ e nee d to satisfy 
the bound (|2.1[) on the whole matrix based on information in each block. The bound 
may be achieved by at least three methods. 

The first method — really, a class of methods — is to prescribe the rank of each block 
based on analysis of problem-specific information: for example, an integral equation 
and its discretization [8]; or in the context of proving error bounds for far more 
sophisticated hierarchical algorithms than we consider: for example, the i^-matrix 
method of [3J. Our interest is in applying the relatively simple H- matrix method to 
arbitrary kernels, and so we do not further consider these approaches. 

We call the second method the block-wise relative- error method (BREM). This 
method is standard in practice: for example, equation (3.69) and surrounding text 
in PQ, the software package AHMED by the same author, numerical experiments in 
Chapter 4 of [7], and relevant experiments in Section 4.8 of [5] all use or describe this 
method. Each LRA Bi to block Bi is computed to satisfy 

II^IIf < spillF. (3.1) 

Then 

Pill = E II^IIf < E II^Hf = £2 H b IIf» (3-2) 

i i 

which implies (12.11) . 

The third method supposes ||B||f is known. We call it the matrix-wise relative- 
error method (MREM). We believe that this method, though simple, is new. Each 
LRA B t to Bi G R m * x ™* is computed to satisfy 



||^||f<^9^||B||f. (3.3) 

As N 2 = Y^iei 171 ^' 

\\E\\l = E ll^lll < ^A^USIHE™ = £2 H B !If. (3-4) 

i i 

which again implies (|2.1[) . 

If (|3.1j) and p.3[) are equalities rather than inequalities, then so are respectively 
(I3~2l and (l3~4l) . 

The efficiencies of BREM and MREM are determined by the magnitudes ||Bi||F 
and N^ 1 y/mini\\B\\F, respectively, as a function of i G X. One magnitude cannot 
dominate the other for all i £ I: as equality is possible in (|3.2[) and (|3.4j) . if for one 
block the first magnitude is greater than the other, then there is at least one other 
block for which the opposite holds. 
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MREM requires that a block LRA satisfy (on rearranging (13.310 

{rrnmy^EiWl < eN- 2 \\B\\l. (3.5) 

On each side of this inequality — omitting e on the right side — is a quantity we call the 
Frobenius norm squared per element (FNPE). If we also square and divide (|3.ip by 
miUi, then our two magnitudes from before become respectively the block and matrix 
FNPE (m.ini) _1 ||i?i||| and A~ 2 ||i?|| F . For a matrix B arising from a singular kernel, 
the block FNPE is smaller for far-off-diagonal blocks than for near- and on-diagonal 
blocks. Hence relative to BREM, MREM requests less accuracy for far-off-diagonal 
blocks and more for the others. 

BREM and MREM require different termination criteria in the block LRA al- 
gorithm. BREM requires the LRA algorithm to terminate based on an estimate 
of the relative error with tolerance e; MREM, the absolute error with tolerance 
eN^ 1 y/mini\\B\\p. Neither termination criterion is more difficult to implement, or 
requires more work to evaluate (within a small number of operations), than the other. 

So far we have discussed the error bound in the Frobenius norm; now we discuss 
the bound in the max-norm. If every block i satisfies 

\\Ei\\ max <i, (3.6) 

then (|2.4p holds. We call this procedure MREMmax. Consider the inequalities (|3.5p 
and (|3.6[) . In both, the right side is an absolute tolerance that is the same for every 
block; and the left side describes an element-wise quantity: in the first, the FNPE; in 
the second, the maximum magnitude. Hence MREM and MREMmax behave similarly 
in how they map the matrix tolerance to the block ones. It is not clear to us whether 
there are any applications that use MREMmax rather than BREM. We shall shortly 
discuss why MREM is preferable to MREMmax in practice. 

3.1. Estimating ||S||f- In some problems ||S||f may be available, but we also 
need a means to estimate this norm. We describe two methods. 

The first is a stochastic method. Let {Xi} be n iid samples. The estimator of 
the mean is, as usual, \i = n^^-J^i^-i- It is useful to estimate confidence intervals 
on the estimator \x using a resampling method. A straightforward approach is to 
compute the delete-1 jackknife variance of the estimator, which in this simple case 
is n _1 *^2i{Xi — /i) 2 . We use the square root of this value and call it the jackknife 
standard deviation (JSD). 

To estimate ||-B|||, we could select a subset of entries. In practice we choose 
to select a random subset of columns. This choice implies X* = N^2^ =1 B^. We 
increase the number of columns until the JSD is less than a requested tolerance. 

A second method is to obtain an initial approximation B to B with a very large 
tolerance e. One can use either BREM or MREM; if MREM, use the stochastic 
method to obtain the initial estimate of ||-B||f- As \\B — B\\p < £J|-B||f and B = 
B + (B — B), ||-B||f < (1 + e)||-B||f. Hence we can safely use (1 + eT^SHf as the 
estimate of ||-B||f in MREM to obtain the final approximation B. Computing ||-B||f 
requires work proportional to nnz(£>). 

3.2. Recompression. Let B € R mxn now be a block; in this subsection we 
suppress the subscript i. Let B 1 = U l (V l ) T be the rank-r output of an LRA al- 
gorithm. One can improve a suboptimal LRA by using a recompression algorithm 
PP. Such an algorithm attempts to compress B 1 ; the algorithm's efficiency results 
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from the outer-product structure of B 1 and that B is not accessed at all. For the 
latter reason, the algorithm's error bound must be based on only B . An example 
of a recompression algorithm is to compute the singular value decomposition (SVD) 
of B 1 — which is efficient because of the outer-product structure of B 1 — and then to 
discard some subset of the singular values and vectors. Let B 2 be the output of a 
recompression algorithm. 

Let e be a block absolute tolerance, as in MREM and MREMmax. We must 
assure \\B — B 2 \\ F < e; what follows also holds if 'F' is replaced by 'max'. Let 
< a < 1. First, compute B 1 so that \\B — B 1 ^ < cte. Second, compute B 2 so 
that US 1 - S 2 ||f < (1 - a)e; for generality later, let j8 = 1 - a. As B - B 2 = 
(B - B 1 ) + (B 1 - B 2 ), \\B - B 2 \\ F < \\B - B 1 ^ + {{B 1 - B 2 \\ F < as + (1 - a)e = e, 
and so B 2 satisfies the tolerance. 

Now let £ be a block relative tolerance, as in BREM. We must assure \\B — B 2 \\ F < 
£||S||f. Again, let < a < 1; we must determine f3 > 0. First, compute B 1 so that 
\\B - B^If < ae||S|| F . Second, compute B 2 so that p 1 - B 2 \\ F ^ ^p 1 ^. If 
in this second inequality the right side involved ||B||f rather than Ij-B^jp, then the 
calculation would be the same as in the case of absolute error. But here we must 
bound WB 1 ^. As B 1 = (B 1 -B) + B, P^f < \\B - B 1 ^ + \\B\\ F . From the first 
step, IIB-S^If < ae\\B\\ F , andsoJlS 1 ^ < (l + ae)||B|| F . Let /3 = (l-a)/(l + ae). 
Then ||B-B 2 i| F < \\B - B 1 \\ F + WB 1 - B 2 \\ F < ae\\B\\ F + ^£||B 1 || F < ae||j9|| F + (1 - 
a)£||B|| F = e||S|| F , and so again B 2 satisfies the tolerance. 

In the case of SVD recompression, following Section 1.1.4 of [I], let QjjRu be 
the thin QR factorization of U 1 and similarly for V 1 . Let WY>Z T be the SVD of 
RuR^. Then Q U WY,Z T Q^ is the thin SVD of B 1 . This SVD is obtained with 
0({m + n)r 2 + r 3 ) work. Choose the first k < r columns of W and V and singular 
values <Tj and set U 2 — C}i \\ : .\ : i,Y. and V 2 = Q\rZ:,i:k- This gives the LRA B 2 = 
U 2 (V 2 ) T ; nnz(B 2 ) < nnz(_B 1 ), with strict inequality if k < r. When one uses MREM 
and so the Frobenius norm, one chooses k based on the inequality \\B 2 — B 1 ^ < 
f3e, where B 2 uses the k largest singular values and associated vectors. Because 
\\B 2 — B X \\ F = ff, 2 ) 1 ' 2 , choosing k requires just 0(r) operations. Using 

MREMmax entails more work because there is no simple relationship between the 
singular values and the max-norm. One performs a binary search on the ordered list of 
singular values. At each step, all the elements of B 2 must be computed and compared 
with B 1 to find the maximum deviation. Hence selecting k requires 0(mn log r) work; 
this work can dominate the work to compute the SVD if r is small relative to m and 
n. Because SVD recompression is a very effective way to improve an LRA, MREM is 
preferable to MREMmax if one does not prefer the 1- to the 2-norm in the associated 
error bounds. 

4. Numerical experiments. We have several goals for our numerical exper- 
iments. First, of course, we must show that MREM can yield greater compression 
than BREM on at least some problems of interest. Second, we want to show that 
MREM is robust: ideally, it should never yield less compression than BREM. Third, 
we are interested in the errors that MREM achieves: are they indeed only slightly 
better than requested? 

We use adaptive cross approximation (ACA) [2] as implemented in AHMED to 
find block LRA. To implement the absolute block tolerance required by MREM, we 
modified two lines of code. A good LRA algorithm must compress a matrix well while 
requesting as few matrix entry evaluations as possible. We find that ACA is quite 
efficient and robust on the problems in our test set. It terminates when the relative 
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error is within a factor of approximately 10 either side of the tolerance; in particular, 
note carefully that occasionally (but infrequently) the error is worse than requested. 
Of course ACA could terminate with a more precise error, but increased termination 
precision would entail additional work. Instead, we accommodate this behavior as 
follows. Let e be the block tolerance. In the recompression tolerance we set a = 1/2; 
if MREM is used, = a; if BREM, $ is as described in Section [321 We run ACA 
with a tolerance of ae/10, where the factor of 1/10 compensates for ACA's sometimes 
imprecise termination behavior, and then the SVD recompression algorithm with f3e. 
LRA algorithms based on interpolation (e.g., [1]) mitigate the problem of terminating 
with less accuracy than requested while still remaining efficient. In any case, the 
choice of LRA algorithm is independent of the choice of either BREM or MREM. 

We estimate ||B||f using the stochastic method described in l3.ll The estimator is 
terminated when the JSD is no greater than 1/50. To be conservative, we subtract two 
JSD from the estimate of ||-B||f, thereby effectively decreasing the tolerance slightly. 

4.1. Particle distributions. We perform numerical experiments using a dis- 
tribution of point-like particles that interact according to the kernel K(r) — l/r p if 
r > 0, otherwise, where r is the distance between two points and p is the order of 
the singularity. We also consider the kernel K(r) = lnr. 

We consider singularity orders p = 1, 2, 3 and three geometries: uniform distribu- 
tion of points in the cube [— 1, l] 3 , on its surface, and on its edges; where indicated in 
figures, these are denoted respectively 'cube', 'surf, and 'edge'. The requested error 
tolerance e is usually 10 -5 . We measure several quantities. The one that most clearly 
demonstrates the performance of MREM relative to BREM is the improvement factor 
(IF), nnz(I? B )/nnz(i? M ), where the superscripts B and M denote BREM and MREM. 
A value greater than one indicates improvement. 

Figures 14.11 14.21 and 14.31 show results for the three geometries. Each column 
corresponds to a kernel, which is stated at the top of the column, and tolerance e, 
indicated by the dashed horizontal line in the top row. In the top two rows, curves 
for BREM have circles; for MREM, it's. All rows plot quantities against problem size 
N. The top row shows the Frobenius-norm-wise relative error achieved; these are 
estimates for problems having size N > 2 15 . The middle row shows the compression 
factor. The bottom row shows the improvement factor of MREM. Most experiments 
were carried out to approximately 2 17 particles, though in a few cases the runs were 
terminated early to save computation time or because of memory size. 

Trends accord with our observations in Section [3l The IF increases as the domi- 
nance of the (near) diagonal of B increases. In the context of our test problems, the 
IF increases with increasing order p and sparser geometry. The IF rarely is below 1, 
and then by only an extremely small amount and only on problems whose singularity 
is of order 1. For both BREM and MREM, achieved errors are always below the 
requested tolerance; and MREM's achieved errors are almost always within a factor 
of 10 of it. 

Figure 14.41 shows the behavior of the two methods for p = 2,3 and the three 
geometries when the requested tolerance is varied. The first three rows are as in 
the earlier figures except that the abscissa is the tolerance rather than problem size. 
For easy reference, in the first row requested tolerance is also indicated by the line 
having dots. The fourth row plots compression against achieved error. On these test 
problems, MREM achieves greater compression than BREM even when the achieved 
errors are the same. 
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Fig. 4.1. Particles are distributed uniformly on the edges of a cube. 

4.2. A problem in geophysics. The problem that motivated this work is one 
in geophysics. We are modeling the evolution of slip on a fault in an elastic half space. 
One part of the computation is a matrix- vector product relating slip on the fault to 
stress. The fault is discretized by rectangular patches, and stress at the center of all 
patches due to unit slip on one patch is computed using the rectangular dislocation 
code of jBj. The matrix is approximated by an if-matrix. We performed a test on a 
fault discretized by 156 x 402 rectangles with a tolerance of 10 -5 . For respectively 
BREM and MREM, the actual errors arc 1.28 x 10~ 8 and 1.71 x 10" 6 ; the compression 
factors are 16.89 and 75.74, for an improvement factor of 4.48. 

5. Conclusions. For many problems, MREM produces a more efficient approx- 
imation than BREM for a requested error tolerance by producing an approximation 
B that is little more accurate than is requested. MREM rarely produces a less ef- 
ficient approximation, and then only by a small amount. Improvement factors are 
often between 1.5 and 5. The improvement factor increases with the dominance of 
the diagonal. For problems in which the order of the singularity is 1, MREM's results 
are almost exactly the same as BREM's. For higher-order problems, MREM is con- 
sistently better by a substantial and practically useful amount. MREM is also better 
than BREM for high-order singularities even when achieved, rather than requested, 
error is considered. 

Using MREM rather than BREM requires the extra work to estimate ||-B||f if it 
is not known. In our experiments, ||-B||f is estimated in a time that is approximately 
1% of the total construction time. Because the work to find a block LRA increases 
with the block tolerance, MREM is faster than BREM on any problem in which the 
resulting compression is greater by at least a small amount. 
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Fig. 4.2. Particles are distributed uniformly on the surface of a cube. 
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Fig. 4.3. Particles are distributed uniformly inside a cube. 
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Fig. 4.4. The requested tolerance e is varied. The number of particles is approximately N = 2 13 . 

The similarity in block tolerances suggests that the compression factors resulting 
from MREM and MREMmax can be expected to scale with problem size and 
tolerance similarly. However, SVD recompression is far more efficient when using 
MREM. We conclude that one should consider using MREM when a problem's 
singularity has order greater than 1. 
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